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ABSTRACT 

This  paper  discusses  the  ILLIAC  IV  computer  and  its  application 
to  seismic  signal-processing  problems.   ILLIAC  IV  "will  be  operational  at 
the  University  of  Illinois'  Computer  Science  Department  in  1970. 

ILLIAC  IV  is  an  array  of  256  coupled  high-speed  digital  computers; 
as  such,  it  has  the  capacity  to  execute  algorithms  256  times  faster  than 
present-day  computers.   Achievement  of  large  speed-up  factors,  however,  is 
dependent  upon  "structuring"  an  algorithm  into  groups  of  (nominally,  256) 
identical  computations.   As  an  aid  to  the  programmer,  the  Tranquil  language 
is  being  designed  with  certain  common  statements,  such  as  matrix  operations, 
prestructured  to  provide  for  the  parallel  execution  of  code  on  many  pieces 
of  data.   These,  as  well  as  other,  aspects  of  the  ILLIAC  IV  System  and  the 
Tranquil  language  are  described  in  the  first  part  of  this  paper. 

The  second  part  of  the  paper  discusses  the  structuring  of  common 
signal-processing  algorithms,  such  as  beam  forming,  convolution,  and  fast 
Fourier  transform,  for  ILLIAC  IV  parallel  computation.   It  is  shown  that, 
in  a  great  variety  of  situations,  the  full  speed-up  factor,  256,  can  be 
obtained. 


* 

ILLIAC  IV  is  being  fabricated  by  the  Burroughs  Corporation  for  delivery  at 

the  University  of  Illinois  in  early  1970- 
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THE  ILLIAC  IV  COMPUTER 

System  Organization 

ILLIAC  IV  is  an  array  of  256  coupled  computers  driven  by  instruc- 
tions from  a  common  control  unit.    Each  of  the  256  processing  elements 
(PEs)  has  2048  words  of  6k   bit  memory  with  a  2^0  nanosecond  cycle  time. 
Each  PE  is  capable  of  6k   bit  floating  point  multiplication  in  U00  nano- 
seconds and  addition  in  2^0  nanoseconds.   Thirty-two  bit  floating  point 
operations  and  8  bit  fixed  point  operations  are  also  available.   The  PE 
instruction  set  is  similar  to  that  of  conventional  machines,  with  two 
exceptions.   First,  the  PEs  are  capable  of  communicating  data  to  four  neigh- 
boring PEs  by  means  of  routing  instructions.   Second,  the  PEs  are  able  to 
set  their  own  mode  registers  to  effectively  disable  or  enable  themselves. 

Figure  1  shows  6k   PEs,  each  having  three  arithmetic  registers 
(A,  B,  and  C)  and  one  protected  programmer  register  (S).   The  registers, 
words,  and  paths  in  Figure  1  are  all  6k   bits  wide,  except  the  PE  index 
registers  (XR),  mode  registers,  and  as  noted.   The  mode  register  may  be 
regarded  as  one  bit  which  may  be  used  to  block  the  participation  of  its  PE 
in  any  action.   The  routing  registers  are  shown  connected  to  neighbors 
at  distances  ±1  and  18;  similar  end  around  connections  are  provided  at 
1,  6k,    etc.   Programs  and  data  are  stored  in  PE  memory.   Instructions  are 
fetched  by  the  control  unit  (CU)  as  required. 

Figure  1  also  shows  the  essential  registers  and  paths  in  the  CU 
and  their  relations  to  the  PEs.   Instructions  are  decoded  and  control 


For  a  more  detailed  discussion  of  the  ILLIAC  IV  hardware  and 
software,  see  [1]  and  [2]. 
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signals  are  sent  to  the  PE  array  from  the  control  unit.   Some  instructions 
are  executed  directly  in  the  CU;  e.g.,  the  loading  of  CU  accumulator 
registers  (CARs)  with  program  literals.   Operand  addresses  may  be  indexed 
once  in  the  CU  and  again  separately  in  each  PE.   It  is  possible  to  load  the 
local  data  buffer  (6k   words  of  6k   bits  each)  and  CARs  from  PE  memory. 
Local  data  buffer  registers  and  CARs  may  be  loaded  from  each  other.   A 
broadcast  instruction  allows  the  contents  of  a  CAR  to  be  transmitted  simul- 
taneously to  all  PEs.   It  is  often  convenient  to  manipulate  all  PE  mode 
bits  or  a  number  from  one  PE  in  a  CAR.   For  this  purpose,  the  broadcast  path 
is  bidirectional. 

ILLIAC  IV  has  k   control  units,  each  of  which  may  drive  6k   PEs 
independently.   In  the  united  configuration,  all  256  PEs  are  effectively 
driven  by  one  CU  and  routing  proceeds  across  PE  boundaries.   This  allows 
some  flexibility  in  fitting  problems  to  the  array. 

The  complete  ILLIAC  IV  system  (Figure  2)  includes  a  Burroughs 

6500  which  performs  input/ output  operations,  compilation,  and  contains  an 

9 

operating  system  to  control  ILLIAC  IV.   Also,  there  is  a  10  bit,  head  per 

track  disk  with  a  kO   ms.  rotation  speed  and  an  effective  transfer  rate  of 
lCr  bits/second.   This  allows  the  loading  of  the  32  X  10  bit  ILLIAC  IV 
memory  from  the  disk  in  32  ms.   The  input/ output  controller  (IOC)  contains 
a  queuer  for  disk  requests  and  a  2-*-°  bit  buffer  memory  to  smooth  transmissions 
to  and  from  the  slower  B-65OO  memory. 

Programming  ILLIAC  IV 

It  is  apparent  that  ILLIAC  IV  may  be  suited  to  the  execution  of 
algorithms  defined  on  arrays  of  data  (e.g.,  matrix  operations  and  certain 
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common  signal -processing  algorithms).   However,  there  are  two  major  diffi- 
culties in  using  such  a  machine.   One  is  that  an  algorithm  must  be  devised 
which  allows  identical  computations  to  be  performed  simultaneously  on  a 
large  number  of  operands.   The  other  is  closely  related  to  the  first:   The 
data  must  be  so  placed  in  the  ILLIAC  IV  memory  that,  over  the  course  of  the 
calculation,  few  PEs  are  disabled.   In  other  words,  the  algorithm  and  the 
data  storage  allocation  must  be  mutually  designed  for  highly  parallel 
computation.   For  this  reason  ILLIAC  IV  codes  are  expressed  as  two  parts: 
a  storage  allocation  block  followed  by  a  computational  algorithm  block. 
These  codes  may  be  written  in  a  high  level  language  called  Tranquil  which 
is  described  below. 

Storage  Allocation 

In  some  cases  data-array  elements  are  mapped  directly  into  PE 

memory  locations  (i.e.,  element  a. .  goes  to  the  i   location  in  the  j   PE). 

In  other  cases  this  may  be  unsatisfactory.   For  example,  if  it  is  desirable 

to  have  parallel  access  to  columns  of  the  array  as  well  as  rows,  one  may 

use  a  skewed  storage  scheme  such  as:   a. .  goes  into  the  i   row  of  the  PE 

10 

whose  number  is  (i+j)  mod  n,  where  n  is  the  number  of  PEs.   This  is 

th 
illustrated  in  Figure  3(a).   Now,  for  example,  the  0   column  of  the  A 

matrix  may  be  accessed  by  setting  the  XR  of  the  i   PE  to  i. 

Tranquil 

The  Tranquil  language  is  similar  to  existing  languages  for 
serial  machines  except  that  certain  statements  provide  for  the  parallel 
execution  of  code  on  many  pieces  of  data.   Operations  on  entire  arrays  may 
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be  performed  by  simply  declaring  arguments  to  be  arrays.   Thus  if  A,  B, 
and  C  are  declared  to  be  m  x  m  matrices, 

A  ^B*C 
will  be  compiled  as  a  matrix  multiplication  code,  assigning  the  product  to 
matrix  A. 

Index  sets  may  be  used  to  refer  to  particular  elements  of  an 
array  or  to  control  a  block  of  code.   Tranquil  allows  two  kinds  of  control 
statements--SEQ  and  SIM.   SEQ  is  analogous  to  the  standard  sequential  con- 
trol statement  of  the  FORTRAN  DO  or  ALGOL  FOR  variety.   SIM,  however,  allows 
simultaneous  operation  on  all  elements  of  an  index  set.   For  example,  if 
we  define  II,  J J,  and  KK  to  be  index  sets  as  follows: 

II  «-<  0,  2,  ...,  2n> 
JJ  «-  <  1,  3,  •  •  • ,  2n+l> 
KK  <-  <  9,    •••,   2n+l> 
then 

FOR  (I, J)  SIM  (II*JJ)  DO 
BEGIN 

C(I,J)  «-o. 
FOR  K  SEQ  (KK) 

C  (I, J)  *-C(l,j)  +  A(I,K)  *  B(K,J); 
END 
will  cause  the  matrix  C  to  have  the  elements  in  the  intersection  of  even 
rows  and  odd  columns  set  to  the  inner  products  of  appropriate  rows  and 
columns  of  A  and  B. 

Tranquil  also  has  conditional  expressions  which  test  if  any  or 
all  elements  of  an  array  meet  some  Boolean  condition. 
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Tranquil  will  provide  a  number  of  canned  storage  allocation 
schemes  for  standard  geometries  plus  some  slicing  algorithms  for  arbitrary 
shapes.   The  user  will  also  have  the  prerogative  of  writing  his  own 
allocation  map.   (The  Tranquil  storage  allocation  block  provides  this  mapping 
from  symbolic  notation  into  PE  locations.)   The  Tranquil  translator  uses 
this  storage  allocation  information  when  translating  the  Tranquil  algorithm 
into  ILLIAC  IV  machine  language.   In  particular.,  routing,  indexing  and  mode 
operations  will  be  generated  from  the  storage  allocation  information.   An 
example  of  a  Tranquil  storage  allocation  statement  is  shown  in  Figure  3(b). 
This  statement  defines  the  skewed  storage  scheme  shown  in  Figure  3(a);  the 
skewing  into  memory  is  done  by  the  compiler. 

EXECUTION  OF  SIGNAL  PROCESSING  ALGORITHMS 
Digital  Filter  Design 

Several  multichannel  filter  design  procedures  [3>^]  involve  matrix 
operations --primarily  addition,  multiplication;,  transposition  and  inversion. 
As  explained  previously,  these  operations  will  be  a  part  of  the  ILLIAC  IV 
software.   Thus,  if  r,  a,  and 'a  are  matrices  and  T  denotes  transposition, 
expressions  such  as 

(3)  T     (3)  T    (3) 

which  appears  in  the  Robinson  synthesis  [5]*  will  be  compiled  and  executed 
with  high  efficiency.   If,  for  example,  the  problem  involves  an  array 
consisting  of  8  subarrays,  each  of  32  sensors,  the  design  procedure  may  be 
implemented  simultaneously  for  all  eight  subarrays.   The  problem  of 
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computing  correlation  matrices  (e.g.,  the  r's  in  the  above  formula)  from 
sampled  data  is  also  equivalent  to  matrix  multiplication. 

The  following  computation,  which  involves  approximately  3  X  10 
multiplications  of  25  X  25  matrices  is  estimated  to  take  about  10 
seconds:   Given  an  array  consisting  of  10  subarrays  of  25  sensors  each, 
use  the  Robinson  synthesis  technique  to  design  for  each  subarray  a  100 
point  least  squares  filter. 

Beam  Forming 

Figure  k   illustrates  ILLIAC  IV s  capability  to  form  beams  for  a 
large  number  of  sensors.   The  rectangular  structures  represent  the  computer- - 
its  256  PEs  stretching  horizontally  across  the  figure  and  PE  memory  extend- 
ing downward.   The  circles  represent  the  sensors  and  the  incoming  signal  is 
labeled  by  the  letter  "S"- -superscripts  denote  time-values;  subscripts 
indicate  a  particular  sensor.   Sensors  are  "paired"  with  PEs;  that  is,  each 
sensor  "feeds"  a  particular  PE.   If,  for  the  moment,  attention  is  restricted 
to  a  single  straight  summation  beam,  then  what  is  desired  is  the  vector - 
element  sum, 

si1  +  s21+  •••  +  s256i  (1) 

of  the  256-element  vector  stretching  across  the  A  registers  of  the  PEs. 
Upon  output  of  the  vector -element  sum  of  time  step  i,  the  data  of  time 
step  i  +  1  can  be  presented  and  the  process  continued.   This  process  uses 
the  vector -element  summation  algorithm  illustrated  in  Figure  5  and  achieves 
a  speed-up  of  32:1  over  serial  operation.   Figure  6  exhibits  a  Tranquil 
program  for  this  procedure.   The  program  will  be  compiled  as  the  series  of 


6  - 


routes  and  adds  illustrated  in  Figure  5.   After  execution,  the  vector -element 

2 

sum  (l)  is  found  in  all  PEs. 

It  should  be  realized,  however,  that  even  though  a  speed-up  of 
32:1  over  serial  operation  has  been  achieved,  ILLIAC  IV  is  operating  at  a 
low  12.5$  efficiency  since  at  6k   bit  precision  the  maximum  speed-up  is 
256:1.   In  view  of  this,  the  question  immediately  arises:   Can  the  summa- 
tions of  successive  time-steps  be  interleaved  to  produce  a  more  efficient 
algorithm?   It  turns  out  that  nearly  100$  efficient  operation  can  be  achieved 
by  any  of  several  interleaving  schemes.   One  scheme  [6]  has  a  structure 
essentially  identical  to  the  method  (Figure  9)  used  to  execute  the  Cooley- 
Tukey  algorithm. 

ILLIAC  IV  can  efficiently  form  beams  for  arrays  of  more  than  256 
sensors.   For  example,  to  form  beams  for  512  sensors,  the  data  is  "folded 
over"  so  that  sensors  257  through  512  also  feed  PEs  1  through  256, 
respectively.   Thus,  the  first  ILLIAC  IV  addition, 

«  i         g  i  «   i 

bl         b2  b256 

+  s25?  ,   +  s258  ,       +  s512   , 

is  completely  parallel  even  without  interleaving.   For  an  array  of  5000 
sensors,  the  data  is  "folded  over"  until  the  number  of  sensors  is  depleted 
(20  times)  and  19  additions  would  be  performed  in  parallel  before  a  vector- 
sum  technique  is  implemented.   (Note  that  some  efficiency  is  lost  because 
5000  is  not  an  integer  multiple  of  256. )   Fifty  subarrays,  each  consisting 


2 
To  avoid  a  cluttered  appearance,  Figure  5  shows  only  the 

operations  involved  in  obtaining  the  particular  sum  which  ends  up  in  PE  1. 
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of  100  sensors,  can  be  processed  in  much  the  same  way  as  a  single 

5000  sensor  array  [6] .  At  a  seismic  sampling  rate  of  20  per  second, 

the  maximum  number  of  sensors  that  could  be  handled  in  a  real-time  (single) 

beam  forming  situation  is  50  million.   If  the  sensors  are  used  in  a  multiple 

beam- forming  situation  (see  below),  the  number  50  million  is  reduced 

accordingly. 

The  more  general  problem  of  delay-and-sum  beam  forming  involving 
many  different  beams  is  illustrated  in  Figure  J.      The  figure  shows  a  portion 
of  memory  (loc  1  through  loc  h)   large  enough  to  accommodate  the  largest 
delay.   The  beam  patterns  shown  define  the  beams  for  all  time.   The  inde- 
pendent access  feature  of  PE  memory  (i.e.,  the  PE  index  register)  enables 
the  components  of  each  beam  to  be  accessed  simultaneously,  and  the  sum  can 
be  formed  as  if  it  were  a  straight  sum  beam.   After  all  beams  have  been 
processed  for  t  =  1,  new  data  (in  Figure  "J:      S  )  may  be  entered  into  memory 
(in  Figure  rJ:      loc  1  or  loc  5),    and  the  process  is  repeated. 

Convolution 

The  convolution  of  a  sampled  time-series  with  predetermined 
filter  weights  can  become  time-consuming  if  (l)  the  problem  involves 
multichannel  filters,  or  (2)  the  length  of  the  filter  is  very  large. 
These  cases  are  considered  separately. 

(l)  Suppose  samples  are  fed  to  each  PE  from  a  particular  sensor 
(as  in  Figure  k) .      The  parallelism  for  multichannel  convolution  is  evident-- 
each  PE  performs  a  single  channel  convolution  and  the  results  are  summed. 
For  example,  a  50-point  convolutional  filter  would  involve  a  standard 
serial  convolution  algorithm  which  is  executed  simultaneously  in  all  PEs 


-  8  - 


followed  (after  every  50  intra-PE  operations)  by  a  vector-element  summation 
across  the  entire  PE  array  (in  the  case  of  one  256- channel  filter)  or  sums 
across  sections  of  the  PE  array  (in  the  case  of  several  smaller  multichannel 
filters).   These  vector-element  summations  may  be  interleaved  if  desired. 

(2)  Less  transparent,  but  still  easily  implemented,  is  the  case 
of  a  single  long  convolution.   Consider  the  256-point  single  channel  con- 
volution illustrated  in  Figure  8.   Denoting  filter  weights  by  W  and  signal 
samples  by  S,  the  desired  successive  filter  outputs  are 

(sV  +  S2*2  +  . . .  +  S256^56),    (SV  +  SV  +  . . .  +  S25V56),   etc. 
At  t  =  0  the  weights  are  spread  across  register  S.   At  t  =  1,  S  is  received 
in  the  CU,  broadcast  multiplied  into  the  array  of  weights,  and  this  result 
added  to  loc  1.   The  weights  now  (actually,  while  the  addition  is  in 

progress)  are  routed  leftward  one  PE.   (See  t  =  1  in  Figure  8.)  At  t  =  2, 

2 

S  is  received,  multiplied  into  the  weights,  and  the  partial  sum  in  loc  1 

is  updated.   The  weights  move  leftward,  and  so  on.   At  t  =  256  the  first 

3 
genuine  filter  output  is  found  in  PE  1.   Thereafter,  each  iteration  produces 

a  new  filter  output  which  is  found  in  the  PE  leftward  of  the  previous  output. 

Thus,  the  second  output, 

sV  +  sV  +  ...  +  s25V56, 

will  be  found  in  PE  256.   After  each  result  is  obtained,  loc  1  in  the  PE  it 
came  from  is  reset  to  zero.   It  should  be  noted  that  this  scheme  is  not 
restricted  to  256  samples,  and  that  the  procedure  achieves  a  speed-up  factor 
of  approximately  256  over  serial  operation. 


3 
Dummy  outputs  were  produced  at  times  1  through  255  as  denoted  by 

the  O's  in  loc  1.   This  is  done  so  that  the  same  algorithm  suffices  for  all 

time. 
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Cooley-Tukey  Algorithm 

The  Cooley-Tukey  Algorithm  [7]  for  the'  case  N=2  ,  where  p  is  an 
integer,  is  well  suited  for  parallel  computation.   An  ILLIAC  IV  Cooley-Tukey 
program  can  be  conveniently  divided  into  three  sections:   (i)  W-generation, 
(ii)  iteration,  and  (iii)  unscrambling.   W-generation  refers  to  the  task  of 
computing  the  complex  constants  (W's)  needed  to  implement  the  algorithm.   If 
the  algorithm  is  used  successively  for  different  sets  of  data,  the  W's 
remain  unchanged- -that  is,  they  depend  only  on  p.   The  iteration  scheme 
which  will  be  presented  leaves  the  Fourier  components  in  an  unnatural  order, 
and  it  is  necessary  to  unscramble  them.   The  unscrambling  procedure  (iii) 
takes  less  than  10$  the  time  of  (ii).  Thus,  in  a  real-time  situation,  it 
is  (ii)  which  determines  the  computation  time.  For  these  reasons,  and  since 
it  is  "iteration"  which  computes  the  finite  Fourier  transform  of  the  N 
samples,  this  paper  will  present  a  detailed  description  only  of  (ii).   The 
schemes  which  implement  (i)  and  (iii)  are  similar  to  that  for  implementing 
(ii)  and  all  are  highly  parallel  [8]. 

A  good  feeling  for  what  (ii)  accomplishes  can  be  obtained  by 
examining  the  Cooley-Tukey  algorithm  for  a  relatively  simple  case  (Figure  9)« 
The  case  illustrated  involves  16  data  points  (A's)  and  8  PEs.   The  figure 
shows  12  "operations".   Original  storage  appears  in  (l)  ;  Fourier  samples 
are  the  A,  (i)  appearing  (scrambled)  in  Cl2\      .'  The  caption  between  the 
boxes  describes  the  "operation"  which  transforms  the  box  above  the  caption 
into  the  box  below  the  caption.   Note  that  for  ease  in  portrayal  A,  (0)  is 

written  in  place  of  A(0)  +  A(8)W  and  A_(8)  is  written  in  place  of 

0  h 

A(0)  -  A(8)W  ,  etc.   The  switch  ,  multiply,  add,  subtract;  switch,  multiply 


T 

Note  that  the  "switch"  operation  is  of  variable  "length" 
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add,  subtract  structure  of  the  program  is  dictated  by  the  inherent  structure 
of  the  algorithm.   The  algorithm  for  l6  data  points  involves  four  iterations. 
Of  the  four  iterations,  only  the  first  does  not  entail  routing  (in  Figure  9: 
"switch").   The  algorithm  for  32  data  points  would  involve  five  iterations, 
the  data  would  be  folded  over  k   deep,  and  both  the  first  and  second  iter:  : 
would  not  entail  routing  [8].   The  algorithm  for  N  =  4096  data  points  (6k 
bits;  parts  ii  and  iii  only)  is  executed  in  about  .6  millisecond. 
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Fig.  9.  Cooley-Tukey  algorithm. 
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